Creating a tagset, lexicon and guesser for a French tagger

نویسندگان

  • Jean-Pierre Chanod
  • Pasi Tapanainen
چکیده

We earlier described two taggers for French, a statistical one and a constraint-based one. The two taggers have the same tokeniser and morphological analyser. In this paper, we describe aspects of this work concerned with the definition of the tagset, the building of the lexicon, derived from an existing two-level morphological analyser, and the definition of a lexical transducer for guessing unknown words. 1 Background We earlier described two taggers for French: the statistical one having an accuracy of 95–97 % and the constraint-based one 97–99 % (see (Chanod and Ta-panainen, 1994; Chanod and Tapanainen, 1995)). The disambiguation has been already described, and here we discuss the other stages of the process, namely the definition of the tagset, transforming a current lexicon into a new one and guessing the words that do not appear in the lexicon. Our lexicon is based on a finite-state transducer lexicon (Karttunen et al., 1992). We describe in this section criteria for selecting the tagset. The following is based on what we noticed to be useful during the developing the taggers. 2.1 The size of the tagset Our basic French morphological analyser was not originally designed for a (statistical) tagger and the number of different tag combinations it has is quite high. The size of the tagset is only 88. But because a word is typically associated with a sequence of tags, the number of different combinations is higher, 353 possible sequences for single French words. If we also consider words joined with clitics, the number of different combinations is much higher, namely 6525. A big tagset does not cause trouble for a constraint-based tagger because one can refer to a combination of tags as easily as to a single tag. For a statistical tagger however, a big tagset may be a major problem. We therefore used two principles for forming the tagset: (1) the tagset should not be big and (2) the tagset should not introduce distinctions that cannot be resolved at this level of analysis. 2.2 Verb tense and mood As distinctions that cannot be resolved at this level of analysis should be avoided, we do not have information about the tense of the verbs. Some of this information can be recovered later by performing another lexicon lookup after the analysis. Thus, if the verb tense is not ambiguous, we have not lost any information and, even if it is, a part-of-speech tagger …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

comparing a statistical and a constraint - based method

In this paper we compare two competing approaches to part-of-speech tagging, statistical and constraint-based disam-biguation, using French as our test language. We imposed a time limit on our experiment: the amount of time spent on the design of our constraint system was about the same as the time we used to train and test the easy-to-implement statistical model. We describe the two systems an...

متن کامل

Part-of-speech tagging for Swedish

This paper describes the work with a part-of-speech tagger for Swedish. The tagger used in the work was originally designed by Brill (1992) and may be adapted to different languages using annotated training corpora. The training corpus in this case is very small and may be the reason why the tagger is not very accurate in its original form. Extending the lexicon using different methods has enha...

متن کامل

The Development of a Morphosyntactic Tagset for Afrikaans and its Use with Statistical Tagging

In this paper, we present a morphosyntactic tagset for Afrikaans based on the guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES). We compare our slim yet expressive tagset, MAATS (Morphosyntactic AfrikAans TagSet), with an existing one which primarily focuses on a detailed morphosyntactic and semantic description of word forms. MAATS will primarily be u...

متن کامل

Part-of-Speech Tagging of Dutch with MBT, a Memory-Based Tagger Generator

We present a part of speech tagger (morphosyntactic disambiguator) for Dutch, constructed by means of the Memory-Based Tagger generation method. In this approach, inductive learning methods are used to derive a tagger, lexicon and unknown word category guesser fully automatically from a tagged example corpus. Advantages of the approach are (i) fast tagger development time without linguistic eng...

متن کامل

Tagging Urdu Text with Parts of Speech: A Tagger Comparison

In this paper, four state-of-art probabilistic taggers i.e. TnT tagger, TreeTagger, RF tagger and SVM tool, are applied to the Urdu language. For the purpose of the experiment, a syntactic tagset is proposed. A training corpus of 100,000 tokens is used to train the models. Using the lexicon extracted from the training corpus, SVM tool shows the best accuracy of 94.15%. After providing a separat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره cmp-lg/9503004  شماره 

صفحات  -

تاریخ انتشار 1995